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Abstract 

We study the problem of learning from unlabeled samples very general statistical mixture models on 
large finite sets. Specifically, the model to be learned, is a probability distribution over probability 
distributions p, where each such p is a probability distribution over [n] = {1,2,... ,n}. When we sample 
from 1 ?, we do not observe p directly, but only indirectly and in very noisy fashion, by sampling from [n] 
repeatedly, independently K times from the distribution p. The problem is to infer 'd to high accuracy in 
transportation (earthmover) distance. 

We give the first efficient algorithms for learning this mixture model without making any restricting 
assumptions on the structure of the distribution D. We bound the quality of the solution as a function of 
the size of the samples K and the number of samples used. Our model and results have applications to a 
variety of unsupervised learning scenarios, including learning topic models and collaborative filtering. 
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1 Introduction 


We study the problem of leai'ning from unlabeled samples a statistical mixture model that is a combination of 
distributions over a common large discrete domain [n] = {1,2,... ,re}. This is a model that has applications 
to a variety of unsupervised learning scenarios, including learning topic models [26, 34] and collaborative 
filtering [27]. For instance, in the setting of topic models, we are given a corpus of documents, where each 
document is a “bag of words” (that is, each document is an unordered multiset of words). The words in a 
document reflect the topics that this document relates to. The assumption is that there is a small number 
of “pure” topics, where each topic is a distribution over the underlying vocabulary of n words, and that 
each document is some combination of topics. Specifically, a iF-word document is generated by selecting a 
“mixed” topic from a probability distribution over convex combinations of pure topics, and then sampling 
K words from this mixed topic. A good example is the so-called latent Dirichlet allocation model of [10], 
where the distribution over topic-combinations is the Dirichlet distribution. 

The mixture model. In this paper, we consider arbitrary such mixtures (of a more general form), and our 
goal is to learn the mixture distribution, which could be discrete, i.e., have finite support, or continuous. 
More precisely, the mixture distribution, d, is a probability distribution over probability distributions on [n]. 
(Equivalently, ?? is a distribution over the (re — l)-simplex = {x G M” | ||x||i = 1}.) When we draw a 
sample from -!?, we obtain a distribution p S A„. However, we do not observe p directly, but only indirectly 
and in very noisy fashion, by sampling K times independently from p. Thus, our sample is a string of length 
K over the alphabet [re] where each letter is an iid sample from p. We call such a sample a K-snapshot of p. 
(A /c-snapshot corresponds to a document of length K in the topic-model example.) The problem is to learn 
-!? with high accuracy. 

Our mixture model is more general than that in the topic-model learning example, in that we do not 
assume that is supported on the convex hull of k distributions. It is an example of a statistical mixture 
model, where the probability distribution from which the learning algorithm gets samples (the mixed topic 
generating a document, in our topic-model example) is a mixture of other probability distributions (pure 
topics, in our example) that are called the mixture constituents. 

Our results. We give the first efficient algorithms for learning a mixture model without placing any re¬ 
strictions on the mixture. We bound the quality of the solution as a function of the size of the samples; 
clearly, larger samples give better results. A natural way to measure the accuracy of an estimate -d in our 
general mixture model is to consider the transportation distance (aka earthmover distance) between and 
(see Section 2) where the underlying metric on distributions over [re] is the Li (or total variation) distance. 

Given a mixture d supported on a fc-dimensional subspace, our algorithms return an estimate d that is 
e-close to d in transportation distance, for any e > 0, using AT-snapshot samples for K = K{e,k) and 
sample size that is poly(re) and a suitable function of k and e. (Note that the intersection of a A:-dimensional 
subspace with A„ could have exp(A:) extreme points; so saying that d lies in a fc-dimensional subspace is 
substantially weaker than assuming that d is supported on the convex hull of k points.) Our main result 
(Theorem 5.3) is an efficient learning algorithm that uses 0{k^rfi log re/e®) 1- and 2- snapshot samples, and 
(k/e)^^^^ iT-snapshot samples, where K = = poly(A:, 1/e). We also devise algorithms with 

different tradeoffs between the sample size and the aperture, which is the maximum number of snapshots 
used per sample point (i.e., document size), for some special cases of the problem. This includes, most 
notably, the case where ?? is a k-spike mixture, i.e., is supported on k points in A„ (Theorem 6.1). This 
setting has been considered previously (see below), but our algorithm is cleaner and fits into our more 
general method; and more importantly, our bounds do not depend on distribution-dependent parameters (see 
the discussion below). 

To put our bounds in perspective, first notice importantly that we consider transportation distance with 
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respect to the Li-metric on distributions. This yields quite strong guarantees on the quality of our recon¬ 
struction, however working with the Li-metric (instead of L 2 ) makes the reconstruction task much harder, 
both in terms of technical difficulty (see “Our techniques” below) and the sample-size required: the Li dis¬ 
tance between two distributions can be much larger than their L 2 distance, so it is much more demanding 
to bound the Li-error. In particular, this implies that the sample size must depend on n: as noted in [36], 
with aperture independent of n, a sample size of 0(n) is necessary to recover even the expectation of the 
mixture distribution with constant Li-error. The sample size needs to depend exponentially on the dimen¬ 
sion k because one can have an exp(A;)-spike mixture r? (on A„) lying in a fc-dimensional subspace whose 
constituents are 0(1) Li-distance apart; recovering an e-close estimate now entails that we isolate the loca¬ 
tions of the spikes reasonably accurately, which necessitates exp(/i:) sample size. Finally, the aperture must 
depend on k and e. The dependence on k is simply because our learning task is at least as hard as learning 
A:-spike mixtures for which aperture 2A: — 1 is necessary [36]. The dependence on e is because the lower 
bounds in [36] show that there are two (even single-dimensional) ^-spike mixtures, where i = 0(l/e), with 
transportation distance 0(e) that yield identical iT-snapshot distributions for all K < 2i — 1. 

A noteworthy feature of all our results is that our bounds depend only on n, k, and e. In contrast, 
all previous results for learning topic models (including those that consider only fc-spike mixtures) obtain 
bounds that depend on distribution-dependent parameters such as some measure of the separation between 
mixture constituents [34, 36], the minimum weight placed on a mixture constituent, and/or the eigenvalues 
(or singular values) of the covariance matrix (e.g., bounds on cr^, or Li-condition numbers, or the robustly 
simplicial condition) [31, 6, 4, 5]. The distribution-free nature of our bounds is clearly a desirable feature; 
if the desired accuracy is cruder than the distribution-dependent parameters, then fewer samples are needed. 

Our techniques. The main result (Theorem 5.3) is derived as follows. First, we use spectral methods to 
compute from 1- and 2-snapshot samples a basis B for a subspace Span(i?) of dimension at most k that 
nearly contains the support of -!?, and such that learning the projection i/b of on Span(i?) suffices fo learn 
(Section 4). We need to choose B carefully so as to overcome various technical challenges that arise 
because we work with transportation distance in the Li-metric. Specifically, we need fo move between the 
Li and L 2 metrics at various points (the rotational invariance of the L 2 -metric makes it easier to work with), 
and to avoid a yTi-factor distortion due to this movement, we need to establish that an Li-ball in Span(i?) 
is close to being an L 2 -ball in Span(i?) (see Lemma 4.5). This allows one to argue that: (a) is supported 
in an L 2 -ball of radius which makes it feasible to learn it within L 2 -error (and hence Li-error 

e); and (b) projecting this reconstructed mixture to A„ preserves the Li-error (up to a poly (A:, i) factor). 
We remark that the standard SVD technique does not suffice for our purpose, since the resulting subspace 
need not satisfy the above “spherical” property of Li-balls (see also the discussion in Section 4). Next, we 
define a projection of the AT-snapshot samples using B. We compute the estimate of by averaging 
the projections and transforming the result to Span(i?) (see Section 5). The proof relies on large deviation 
bounds. One can show that is close to '&b- The output i9b converges to this projection as the number of 
samples grows. The rate of convergence can be bounded using tools from approximation theory. 

The result for the special case of A:-spike mixtures (i.e., -i? is supported on k distributions) uses a three- 
step approach analogous to the argument in [36], but the implementation of each step is different). The first 
step finds B as in the general case. In the second step (Section 6.1), the algorithm projects the sample data 
onto the basis vectors in B. From this data, the algorithm computes a good approximation to the projection 
of onto each axis. The idea is to use linear programming to compute a piecewise constant discretization 
of the projected measure such that the first K moments are close to the empirical moments derived from the 
samples of AT-snapshots. The analysis uses a classical result in approximation theory due to Jackson that 
estimates the error in approximating a 1-Lipschitz function on [0,1] by the first K Chebyshev polynomials. 
(In fact, this step, too, does not use the special structure of the mixture. It works in the case of an arbitrary 
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measure r?, and our error estimates are asymptotically optimal in general.) In the third step (Section 6.2), 
we use the approximate projected measures to compute a good approximation for the projection of r? on 
Span(S), giving our algorithm’s output. The main idea here is similar to that of the second step. We 
discretize the projection and use a linear program to compute a discretized measure whose projections onto 
the axes used in the second step give a good match to the computed approximations on those axes. The 
analysis of this algorithm uses Yudin’s multidimensional generalization of Jackson’s theorem [42]. Both the 
second step and the third step use Kantorovich-Rubinstein duality to relate the results from approximation 
theory to the approximation guarantees in terms of the transportation distance. 

Related work. Generally speaking, our problem is an example of learning a mixture model. Unlike our 
case, other mixture learning problems, such as learning a mixture of Gaussians (see [18, 9, 32]), assume a 
special structure of the distributions that contribute to the mixture. We discuss this related literature below. 

A few previous papers consider the problem of learning a topic model [6, 3, 4, 36]. They all make 
limiting assumptions on the structure of the mixture model. The only paper that considers an arbitrary 
distribution d over combinations of topics is [6]. However, this paper assumes that the pure topics are 
p-separated, which means that each topic has an anchor word that has probability at least p in this topic, 
and probability 0 in any other topic. In the case of an arbitrary d (over such topics), the paper [6] leams 
the correlation matrix for pairs of pure topics and not r). In the special case of latent Dirichlet allocation, 
the paper also reconstructs r?. The latent Dirichlet allocation setting is also considered in [3]. For this 
special case, they relax the condition in [6] to the requirement that the matrix whose columns are the word 
distributions of the k pure topics has full rank k. The constraints on the model that are imposed in [6, 3] allow 
them to achieve their learning goals using documents of constant size that is independent of the number of 
pure topics k and the desired accuracy e. As we show in this paper, this is impossible in the general case. 
The remaining two papers mentioned above [4, 36] consider only the case where each document is generated 
from a single pure topic, so ?? is a discrete distribution with support of size k. The first paper [4] imposes 
on the pure topics the same rank condition as in [3], and thus is able to learn the model from constant 
size documents. The second paper [36] studies the general pure topic documents case and shows how to 
learn the model from documents of size 2k — 1, which is a tight requirement. Notice that in this case, the 
document size is independent of the desired accuracy. Our results specialized to this case are motivated 
by the techniques in [36]. They give a simpler and cleaner proof that roughly matches the results there (in 
particular, the mixture model is recovered using Ff-snapshots for K = 2k — 1, which is optimal). 

Learning statistical mixture models has been studied in the theory community for about twenty years. 
The defining problem of fhis area was fhe problem of learning a mixfure of high-dimensional Gaussians. 
Sfarfing wifh fhe ground-breaking resulf of [18], a sequence of improved resulfs [19, 7, 40, 29, 1, 23, 12, 
28, 9, 32] resolved fhe problem. Beyond Gaussians, various recenf papers analyze learning ofher highly 
sfrucfured mixture models (e.g., mixtures of discrete product distributions) [30, 25, 16, 8, 33, 17, 24, 29, 13, 
15, 14, 20]. An important difference between this work and ours is that the structure of the mixtures that 
they discuss enables learning using samples that consist of a 1-snapshot of a random mixture constituent 
(which is impossible in our setting). Since Gaussians and other structured mixtures can be learned from 
1-snapshot samples, the issue of the samples themselves being generated from a combination of the mixture 
constituents does not arise there. Our problem is unique to learning from multi-snapshot samples. 

2 Preliminaries and notation 

Let T : Y —> y be a transformation from a normed space X (with norm || • ||x) to a normed space Y 
(with norm || • ||y). Let /r be a measure defined over X. We use p o T~^ fo denofe fhe image measure (or 
pushforward measure) defined over Y\ po T~^{U) = p{T~^{U)) for all measurable U C Y. If is a simple 
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fact that (see e.g., [22]) that for any measurable function /, 


[ /oTd/r. 

Jy Jx 


( 1 ) 


For ease of notation, we sometimes write Tfj, to denote the image measure /i o T~^. For a vector v, we 
use ||u|| to denote its L 2 norm, and for an operator T, we use ||T||x^y to denote its operator norm (i.e., 
||r||x^.y = sup{||Tx||y I X G X, ||x||x = 1})- For ease of notation, we use ||T|| to denote the L 2 L 2 
operator norm of T. 

Transportation Distance: Let (X, d) be a separable metric space. Recall that for any two distributions P 
and Q on S, the transportation distance Tran(i-’, Q) (also called Rubinstein distance, Wasserstein distance 
or earth mover distance in literature) is defined as 



( 2 ) 


where M{P, Q) is the set of all joint distributions (also called coupling) on X x X with marginals P and 
Q. For the discrete case (say X is a finite set of discrete points ui,..., Vn), (2) is in fact the following fa¬ 
miliar transportation LP: minimize d(uj,Uj)xjj subject to Xjj = P({ui}), Vf G [n], = 

£ N, Xjj G [0,1] Vf G [n], j G [n]. Any feasible solution {xjjjjj of the above LP is in fact a 
coupling of P and Q, since it can be interpreted as a joint distribution over X x X, and the constraints of 
the LP dictate the first marginal of {x^} is P and the second is Q. 

Suppose /r is a measure on some metric space (X, d). Let T : X —>• X be an operator. T naturally 
defines a coupling W befween /r and the image measure T/jl: for any i? C X x X, let W{R) = /x({x | 
(x, Tx) G R}) (so for any measurable S P X,W{S x T{S)) = /r(5)). For ease of description, for such a 
coupling, we often say “we couple x with Tx together”. 

Let 1-Lip be the set of 1-Lipschitz functions on X, i.e., 1-Lip := {/ : X —)■ M | |/(x) — f{y)\ < 
d{x, y) for any x, y G X}. We need the following important theorem by Kantorovich and Rubinstein (see 
e.g., [22]): 



(3) 


In the discrete case, Kantorovich-Rubinstein theorem is exactly LP-duality (the dual of the aforementioned 
LP is: maximize subject to /* - fj < d{vi,Vj)'ii G [n],j G [n]. ). 

It is important to notice the transportation distance and the Lipschitz condition are associated with the 
same metric d{x, y). We use Trani and Tran 2 to denote the transportation distance for Li and L 2 metrics 
respectively. In 1-dimensional space, Li and L 2 are the same and we simply use Tran. The following 
simple lemma will be useful in several places. The proofs are standard; we include them in Appendix A for 
completeness. 

Lemma 2.1. (X, || • ||x) and (Y, || • ||y) are two normed spaces. Vfe are given two probability measures p, v 
defined over X such that Tran(/r, u) < e. 

(i) Suppose T : X ^ Y is a transformation from X to Y. Tran(T/i, Tv) < e ■ ||T||x^.y- 

(ii) Furthermore, if both p and v are supported on a subspace V P X, then Tran(T/i, Tzy) < e • ||T||v", 
where \\T\\v = sup^-gy ||rx||y/||x||x- 

(iii) We are given two operators T and T' such that \\T — r'||x^.y < e. Suppose ||T||x^y =0(1) and 

= 0{1) for all x' G Support(z^). Then, we have that Traii{Tp,T'v) < 0(e). 

We state the following standard Chernoff-Hoeffding bound and Bernstein inequality. 
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Proposition 2.2. Let Xi(l < i < n) be independent random variables with values in [0,1]. Let X = 
Yl'i=i For every t > 0, we have that Pr \X — IE[X]| > t < 2 exp(—2t^/n). 


Proposition 2.3. Let Xi{\ < i < n) be independent random variables with ||Xj|| < 1, ]E[Xj] = 0/or all i. 
Let X = Yn^iXi. Let = Var[X] = Var[Xj]. Then, Pr [jXj > t] < 2exp (- 2 (cr^+f/ 3 ) )- 

We will use the following results from the matrix perturbation and random matrix theory. 

Theorem 2.4. (Wedin’s theorem, see e.g., [38, pp.261 ]) Let A,A€ m > n be given. Let the 

singular value decompositions of A and A be 


/Si0\ __ /Si^0\ 

{Ui,U2,U3fA{Vi,V2)=[ 0 S2 , {U^,U2,UsfA{V^,V2)=[ 0 S 2 

V 0 0 / V 0 0 / 


Let be the matrix of canonical angles between Span(t/i) and Span([/i) and 0 be that between Span(yi) 
and Span(pL). If there exists <5, a > 0 such that minjO'j(Si) > a + 5 and maxjO'i(S 2 ) < a, then 

max{|| sin<l>||, || sin0||} < Mz^_ Moreover, Ijllyi — n^jj = || sin$|| (see e.g., [38, pp.43[). 

Theorem 2.5 ([41]). For every constant c > 0, there is a constant C > 0 such that the following holds. Let A 
be a symmetric with entries Oij = aji = Xij, where Xij, 1 < i < J < n are independent random variables. 
Suppose each Xij is such that \Xij\ < K, ]E[Xj/ = 0 and Var[Xjj] < cr^ where a > K\t? n/y/n. 
Then, it holds that 

Pr[||^|| < 2ay/n + Inn] > 1 — 1/n^. 

The Chebyshev polynomial (of the first kind) is defined as the polynomial satisfying T„(cos(x)) = 
cos(nx). An equivalent recursive definition is: To{x) = 1, ri(x) = x and Tn+i{x) = 2xTn{x) — T„_i(x). 
We need the classical Jackson’s theorem (see e.g., [37]) in approximation theory (specialized to our setting) 
and a multidimensional generalization of Jackson’s theorem established by Yudin [42] (Theorem 2.7). 

Theorem 2.6 (Jackson’s Theorem). It is possible to approximate any function on [0,1] in 1-Lip within 
Loo error 0(1/K) using Chebyshev polynomials (or equivalently trigonometric polynomials) of degree at 
most K, i.e., there exist ^^ch that f{x) = Yl^oFTiix) ± 0(1/K) Vx G [0,1]. Moreover, 

|L| < poly(Lr)/or all i = 0,..., K. 

Theorem 2 . 7 . We use 62(77) to denote the sphere {x G | ||a;||2 < R}- For any function f : 62(1) —)• C 
which is 1-Lip (in L 2 distance), there exists complex numbers c(t')for t' C 62 ( 7 ?), such that |c(f')| < 
exp(0(/i))^ and for all x G 62 ( 1 ), 


f{x) 





3 Learning single-dimensional mixtures: the coin problem 

In this section, we consider the problem of learning a mixture b supported on [0,1], which we call the coin 
problem. Using results in [36], these results carry over to the setting where supported on a line segment 

*In Yudin’s theorem, c(t') is in fact f{t')\(t'/R), where f(t') = da; is the Fourier coefficient, 

A(a;) = {([ * 4i){x), (j>(x) is the first normalized eigenfunction of a PDF known as Helmholtz equation, and is the convolution. 
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in the (n — l)-simplex = {x G K> 0 ) ||3;||i = 1}- We first consider an arbitrary (even continuous) ?? in 
[0,1]; in Section 3.1, we consider the case where i? is a k-spike mixture. 

Let Let Nk denote the number of JL-snapshots we take from i?. For 

0 < i < K, define fqj(i9) := / We call fq(?9) := {fq.j(i?)}o<j<ir the frequency vector 

corresponding to 'd. We use fq^ to denote the fraction of sampled coins that showed “heads” exactly i times 
and let fq := {fqi}o<i<it' be the empirical frequency vector. It is easy to see that fq(??) = E[fq]. If we take 
enough samples, the frequency vector corresponding to the empirical measure should be sufficiently close 
to that of i?. 

Lemma 3.1. By taking Nk = log{K/d) samples, with high probability 1 — S, we have that || fq(i?) — 

fq Iloo < K. 

Proof. Using Chernoff bound (Proposition 2.2), we can see that Pr[|fqj(i9) — fqj > k] < 
2 e-x.-p{—2tsf Nk) < SjK. Then the lemma follows from a simple application of union bound over all iL + 1 
coordinates. ■ 


Theorem 3.2. There exists an algorithm, with running time polynomial in K, that gets as input m = 
poly(iT) coins from a mixture i), each tossed K times, and output a mixture r) such that Tran('t?, ?)) < 
0{1/^/K) with high probability. 

Theorem 3.2 can be proved by a simple application of Chernoff bound (where we set = fq^), 

which we omit here. We provide an alternative proof based on Bernstein polynomials later. It is a natural 
question to ask whether 0{1/s/K) in Theorem 3.2 achieves the optimal aperture-transportation distance 
tradeoff. In [36], it is shown that recovering a iF-spike mixture within transportation distance 0(1/K) 
using c{2K — 1) (for any constant c > 1) aperture requires exp(U(iT)) samples. The following theorem 
provides a matching upper bound. 


Theorem 3.3. There exists an algorithm, with running time polynomial in K, that gets as input m = 
exp(0(Ff)) coins from a mixture'd, each tossed K times, and outputs a mixture id such that Tran('i9, ??) < 
0(1/K) with high probability. 

To prove Theorem 3.3, we make a crucial observation (Lemma 3.4) that links the transportation distance, 
the frequency vector and the coefficients of Bernstein polynomial approximation. Lemma 3.6 bounds these 
coefficients using the relation between Bernstein polynomial basis and Chebyshev polynomial basis. We 
then provide a simple LP-based algorithm to reconstruct d. 

Lemma 3.4. Suppose for any f G 1-Lip[0,1], there exist K + 1 real numbers cq, ... ,ck G [~C, C], for 
some value C > 0 and A > 0, such that f = =t 0(A). Then for any two distributions P and Q 

on [0,1], Tran(P, Q) < C • || fq(F’) - fq(<3) ||i + 0(A). 

Proof. We have fqj(P) = f Bi^x dP. For any / G 1-Lip such that f(x) G [0,1] for all x G [0,1], we have 


I fd{P-Q) = Bi,xd{P-Q) 

i=0 
K 

^Ci(fqi(P) -fqi(Q)) 


i=0 


+ 0(A) 


+ 0(A)<0-||fq(P)-fq(Q)||i + 0(A). 


Taking supreme over / on both sides of the above inequality yields the lemma. 


Lemma 3.5. For any function f G 1-Lip[0,1], there exists K + 1 real numbers cq, ... ,cx G [—O, C] with 
C = 0(1) such that f(x) = pBi,K(x) ± 0{1/s/K) for all x G [0,1]. 
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Proof. Let Bxfix) = be the Bernstein polynomial approximation of /. It is known 

that Bxf converges to / uniformly with the following rate for / S 1-Lip[0,1]: WBxf — /||cxd < 0{l/y/K) 
(see e.g., [37]). ■ 

Lemma 3.6. For any function f S 1-Lip[0,1], there exists K + 1 real numbers cq, ..., cx G [~C, C] with 
C = poly(i(r) • 2^ such that f{x) = (^iBi,K{x) ± 0{llK) for all x G [0,1]. 

Proof By Jackson’s theorem (see Theorem 2.6) in approximation theory, for any function / S 1-Lip[0,1], 
there exist (with \ti\ < poly(i6)) such that f{x) = Ylf=QtiTi{x) ± (9(1/76) Vx G [0,1], where 

TjS are Chebyshev polynomials of degrees at most K. Since and are two different 

bases of the linear space of all polynomials of degree at most K, there is a linear transformation M that can 
change from one basis to another basis: For an arbitrary polynomial P{x) of degree at most K, we can write 
P{x) = YA=oCiBi^K{x) = YA=QtiTi{x), where a = ifk=oP^iktk- Using t = {to, ■ ■ ■ ,tK)'^ and c = 
(co,..., ck)'^, we have that c = Mf. It is known that for all i,j, \Mij\ = (2i6—l)!!/(2i —1)!!(276—2i —1)!! 
where n!! = n(n — 2)(n — 4)... (4)(2) for even n and n\\ = n(n — 2)(n — 4)... (3)(1) for odd n [35]. 
Hence, we have that 


< l|M||c 


\ i=0 


< poly(i6) • 2 


K 


This implies that for any / £ 1-Lip, we can also get CjS with \ci\ < poly(i6)2'^ such that f{x) = 
X:£o ± 0{l/K) = YLo ± 0{1/K) for all x G [0,1]. ■ 


Reconstructing rJ. Suppose we have a good empirical frequency vector fq which satisfies 11fq — fq {tt) 11 1 < 
X/C, where A and C are as in Lemma 3.4 Now, we show how to reconstruct the mixture approximately. 
We propose a simple LP-based algorithm as follows. 

We approximate each Bi^x by a piecewise constant function Bi^x in [0,1] such that \\Bi^x — Bi^xWoo < 
e' for e' = 0{k) {k in Lemma 3.1). It is easy to see that 0{\/e') pieces suffice (since Bi^x is either monotone 
or unimodal). We can divide [0,1] into h = 0{Kle') small intervals [oq = 0, ai), [ai, 02 ),..., = 

1] such that in each small interval Bi^x is a constant for all 0 < f < 76. We use bij to denote the value 
of Bi^x in interval [aj,aj+i). For each small interval [aj,aj+i), define an variable Zj (think of Zj as the 
approximation of Oj+i))). Consider the following linear program LP: 

h—1 h—1 

z>0 and ''^Zj = l and ^ = fpj ± e', for f = 0,..., 76. (4) 

j=0 j=0 

It is easy to see that, by Lemma 3.1, zj = ?7([aj,Oj+i)) defined by fhe original mixfure measure ?7 is a 
feasible solution for LP. 

On fhe ofher hand, any feasible solufion of LP produces a frequency vecfor fhaf is close fo fq: Suppose 
z* is an arbifrary feasible solufion of LP and -!? is any disfribufion supporfed on [0,1] fhaf is consisfenf with 
z* (i.e., i9(laj,aj+i)) = zp, we have that 

fqj(?7) = f Bi^xdd = ±e' + f Bi^xdd = ±e' + ^ hj f di? = ±e' + ^ bijz* = fq^ ± 2e'. 

J J j J[aj,aj^-i) j 

Proof of Theorem 3.3. Combining with Lemma 3.1, we have that 

\Md) - fq(i?)||i < 76||fq(i?) - fq(^)||oo < 76(||fq(i7) - fq|U + ||fq - fq(^?)||oc) < 0{Kk). 
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Then, using Lemma 3.1 with 2^^^^ samples, we can make || fq(P) — fq(Q) ||i < IfCK (recall that C = 
poly(iT)2^). So, we finally have that 

Tiixn{d,'d)<C\\^qi{d)-^^i{'&)\\i + 0{l/K)<0{l/K) fov k ^ 0{l/CK^). ■ 

Proof of Theorem 3.2. The proof is the same as that of Theorem 3.3, except that we use Lemma 3.5 instead. 
In this case, it suffices fo use only poly(iT) samples fo ensure fhaf || fq(P) — fq(Q) ||i < 0{1/K). ■ 


3.1 Learning /c-spike mixtures 

We now consider fhe case where is a fe-spike mixfure supported in [0,1], i.e., is supporfed on k poinfs 
in [0,1]. This resulf will be useful later when we consider mixfures in higher dimensions. We now use 
TT-snapshofs only for K = 2k — 1. Lef fhe z-fh momenf of z? be gi{'d) = f x*?9(dx) = J2j=iPj'^j- The 
algorifhm is based on an idenlifiabilily lemma proved in [36] (Lemma 3.7) and ifs converse (Lemma 3.8). 

Lemma 3.7 ([36]). For any two k-spike distributions supported on [0,1], — 5 '(i? 2)||2 > 

^ Tran(i9i ,i92) ^ 

Lemma 3.8. For any two distributions supported on [0,1], and i € [K], |fl'j('!9i) — gi{'& 2 )\ < i ■ 

Tran(z?i,-d2))- 

Proof. For any i G [K], if is easy fo see fhaf x* is z-Lipschifz in [0,1]. Hence, we have 




J xM(??i - ??2) 


< i ■ Tran('t?i, '!? 2 )- 


The lasf inequalify is due fo Kanforovich-Rubinsfein fheorem. ■ 

Recall fhe frequency vector fqj(i?) = f (^)x®(l — x)^“*z?(dx) = — x)^“L Define 

fhe normalized frequency vector fo be nfqj(z?) = f x*(l — x)^“*??(dx) = J2j=iPj^^U ~ Lef Pas 

be fhe 2k x 2k lower friangular Pascal friangle mafrix wifh non-zero enfries Pasjj = (^JTi) for 0 < z < iF 
and i < j < K. \t is nof difficull to verify fhaf g{'&) = Pas nfq(z9) . If is brown fhaf ||Pas|[^< ^ jsf^. By 
Lemma 3.1, using 0{{kj samples, fhe empirical frequency vector fq satisfies fhaf jjfq — fq(z ?)||2 < 
wifh probabilify 0.99. Lef nfqj = fq/(^^). Lef f = Pas nfq be fhe empirical momenf vector. 

If we can find a disfribufion z9 such fhaf || 5 (z?) — g {'&)\\2 < {ej^k)^^^\ we know, by Lemma 3.7, fhaf 
Tran(z?, z?) < e. In order fo find such a z?, we do fhe following, z? is a fe-spike disfribufion supporfed on 
fhe sef of discrete poinfs {0, r, 2r,..., 1} where r = {e/k)^^^\ Firsf, we guess fhe support of z9 (fhere are 
choices). Then, we solve fhe following linear program LPi, where xj represenfs fhe probabilify mass 
placed afpoinf jr G Support(z9): 


LPi : 


^ Xj{jTf - gi 
j 


< 0{Kt), for all z G [7F], 



Xj G [0,1], for all j 


Theorem 3.9. Using {k/e)^^^'l log(l/(5) many {2k — l)-snapshot samples, the above algorithm can produce 
an estimation z?, which satisfies that Tran(z?, z?) < e with probability 1 — 5. 

Proof. We know fhere is a /c-spike measure z?' supporfed on {0, r, 2r,... , 1} such fhaf Tran(z?, z9') < r. 
Hence, \gi{'d') — gi{'&)\ < ix for all z, by Lemma 3.8. Also, 

e\n{k) 

k) 


\\g- 9{'&)\\2 < ||Pas||||nfq - nfq(z9)||2 < ||Pas|| jjfq - fq(z ?)||2 < 


(5) 






Therefore, we have 


- 9i\ < \gi{'&') - 9i{'&)\ + \gi{'&) - 9i\< 0{iT). 

This indicates that LPi has a feasible solution. -!? is a feasible solution of LPi, hence || 5 (i?) — 'g \\2 < 
So, 

lls-W - fi-Wlb < Ilfi-W - gh + lls-W - gh < 0{K^/'^t) < , 

which implies the theorem, by Lemma 3.7. ■ 

4 Learning multidimensional mixtures on A„: a reduction 

We now consider the setting where the mixture -d (on A„) is an arbitrary distribution supported in a k- 
dimensional subspace in M”. In this section, we use Trani and Tran 2 to denote the transportation distances 
measured in Li and L 2 norm respectively. For a point v and a set S, we use 115 (u) to denote the projection 
of V to S, i.e., the point in S that is closest to v. We always assume the projection is with respect to L 2 
distance, unless specified otherwise. For any arbitrary measure supported on we use n 5 (t?) to denote 
the projected measure defined as = 'i9(n^^(r)) for any measurable T Q S. 

This section provides a reduction from fhe original learning problem fo fo fhe problem of learning fhe 
projected measure in a specific subspace Span(i?). Sections 5 and 6 complemenf fhis reduction by devising 
algorifhms for learning fhe projected measure := nspan(B)(?9) (for arbifrary /c-dimensional '& and k- 
spike respecfively); combining fhese algorifhms wifh fhe reduction of fhis section yields algorifhms for 
learning The space Span(i3) will satisfy several useful properties (Lemma 4.5). One parficularly useful 
properly is lhal any unit vector v G Span(i?) has ||u||oo < 0{l/^/n) (ignoring factors depending e and k). 
This implies that Li norm and L 2 norm in Span(77) are quite close up to scaling, hence allow us to convert 
bounds between Li and L 2 distances without losing a factor depending on n (otherwise, we typically lose 
a factor of ^/n). Furthermore, we can show we do not lose too much by working in Span(i?) as most of 
the mass of is very close to Span(i3). Suppose we can learn the projected measure well. If we can 
show 'tis is close to the original mixture in Trani distance, then a good estimation of '&b, would be 
a good estimation of f} as well. However, we are not able to show and are close enough in general. 
Nevertheless, we can prove that a projection of to a smaller polytope is close to f}. Finally, we need to 
make some small adjustments in order to ensure that our estimation i? is a valid mixture, as well as a good 
approximation of i} (see Reduction 1). 

Before we delve into the details of our reduction, we provide some intuition for why we require the 
subspace Span(i?) to satisfy the above-mentioned properties and why the standard SVD method does not 
suffice. For ease of discussion, we freaf e and k as consfanfs, but n as a parameter that can be very large. 
Our goal is to obtain Span(H) of dimension at most k so that if we can learn the projected mixture ds 
within Trani-error at most ei, then we can learn within Trani-distance at most e. We would like ei to be 
independent of n so that the number of iF-snapshot samples required to estimate within Trani-distance 
at most ei is independent of n (as is the case in Theorems 5.3 and 6.1). 

Suppose first that we know A exactly, and we simply use Span(^) as the subspace. In fact, it is not 
difficult to learn f) = within L 2 -transportation distance ei using a sample size independent of n. This 
is mainly due to the rotationally-invariant nature of L 2 , which makes this equivalent to a learning problem in 

However, the same is not true for the Li distance. Note that we place no assumptions on A, so in order 
to obtain an estimate with Trani (i?, f}) < ei, we essentially need to ensure that Tran 2 ('ti, f)) < e\/\fn\ 
however, this would require a sample size depending on n. It is precisely to prevent this ^/n-^^LCtor loss 
that we require that an L 2 -ball in our subspace Span(i7) be close to an Loo-ball (and hence, an Li-ball 
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is “nearly spherical”). This ensures that is supported in an L 2 -ball of radius L = 0{l/^/n), which 
makes it possible to learn within Tran 2 -distance exj^fn with sample size independent of n, since the 
desired error is 0{L). The standard SVD method would typically return the subspace spanned by the first 
few eigenvectors of A\ but this suffers from the same problem as when we know A exactly, since there is no 
guarantee that an L 2 -ball in this subspace is close to an Loo-ball in this subspace. 

We now state the main result of this section. We use the following parameters throughout the paper. The 
polynomial in the definition of C below depends on the specific problems and we will insfantiafe if later. 

C = poly(fc.i), L = = (6) 

Theorem 4.1, Suppose r? is an arbitrary mixture on Span(y4) n where Span(y4) is a k-dimensional 
subspace. Vfe can find a subspace Span(i?) of dimension h {h < k) in polytime such that: 

(i) Span(i?) satisfies all properties stated in Lemma 4.5 {see below); and 

(ii) If we can learn an approximation ^b (supported on Span{B)) for the projected measure 4 )b = 
nspan(s)(^?) such that TTani{4)B,'&B) ^ using Ni{n), N 2 (n) and Nxin) 1-, 2-, and K-snapshot 
samples, then we can learn a mixture 4) such that Trani(r?, r?) < e using 0{Ni{n/e) + nlogn/e^), 
0{N2(n/e) + Ofk^Tif logn/e®)) and 0{NK{n/e)) 1-, 2-, and K-snapshot samples respectively. 

The reduction and its analysis. Let r be the vector encoding the 1-snapshot distribution of d, i.e., r* = 
Pr[the 1-snapshot sample is i] = f Xi'd(dx). We say that the mixture d is isotropic, if G [l/2n,2/n]. 
Using 0{n log n) 1-snapshot samples, we can get sufficiently accurate estimates of r^s with high probability. 

Lemma 4.2 ([36]). For every a > Q, we can use 0(^n log n) independent 1-snapshot samples to get ri 
such that, with probability at least 1 — 1 /n^, for all i G [n], 

Tj G (1 ± a)ri Mi with > (T/2n, < (1 -|- a)a j2n Mi with < a jin. 

Next, we show it is without loss of generality to assume that the given mixture is isotropic, at the expense 
of a small additive error. The argument essentially follows that of [36], but is simpler. 

Lemma 4.3. Suppose we can learn with probability 1 — 6 an isotropic mixture on [n] within Lx trans¬ 
portation distance e using Nx{n), N 2 {n) and Nxin) 1-, 2-, and K-snapshot samples respectively. Then 
we can learn, with probability 1 — 0{5), an arbitrary mixture within Lx transportation distance 2e using 
0{-^nlogn -\- Nx{n/a)),0{N2{n/a)) and 0{NK{n/a)) 1-, 2-, and K-snapshot samples respectively, 
where a < e/4. 

From now on, we assume that the given mixture d is isotropic. Let A to be the n x n symmetric matrix 
encoding the 2-snapshot distribution of d', i.e., Aij is the probability of obtaining a 2-snapshot {i,j). It is 
easy to see that A = xx^d{dx). Note that the support Support('i9) of the mixture d is contained in the 
subspace, Span(^), spanned by the columns of A. For ease of exposition, we first assume that we know A 
exactly. This assumption can be dropped via somewhat standard matrix perturbation arguments, which we 
sketch at the end of this section. Consider the hypercube FL = [—C/n, C/n]” in M” (C only depends on k 
and e, and is fixed later). We now have all fhe nofation fo give a defailed description of the reduction. 
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Reduction 1. 


Constructing the basis B. Input: Matrix A. Output: A basis B satisfying Lemma 4.5. 

Consider the centrally symmetric polytope V = H A Span(A) and the John ellipsoid £ inscribed in V. It is well 
known that £ Q V Q y/k£. Suppose the principle axes of ^fk£ are {ei,..., Cfc}, sorted in nondecreasing order of 

their lengths. We choose the orthonormal basis to be '■ W^iWi > For every G B, it is easy 

to see that ||6,||oo = ^ <^-^ = 0 f ^. 


Final adjustment. Input: Matrix B, Db (which is an approximation aids and supported on Span(fJ)). 

Output: The final estimation D of the original mixture D. 

1. Define the polytope Q = (A" + B"(e)) fl Span(i3). Here B"(e) denotes the Li-ball in R” with radius e, and the 
Minkowski sum A + B of sets A and B is the set{a + & | a G A,b G B}. Essentially, Q is the set of points in 
Span(iJ) with Li norm within [1 — e, 1 + e]. 

2. Let Dq = nQ(i?B) be the measure i9b projected to Q, i.e., 'i?e(5') = i?B(nQ^(5')) for any S C Q. 

3. Notice that Dq may not be a valid mixture since some points in Dq may not be in A”. In this final step, we 
Li-project Dq back into A„ and obtain a valid mixture D (i.e., for each point in Q, we map it to its Li-closest 
point in A”), which is our final estimation of D. 


Lemma 4.4 shows that for large enough C, T-L contains (1 — e) unit of mass of t?. Lemma 4.5 proves 
various properties about Span(i?), which we exploit to prove that the final adjustment procedure returns a 
good estimate of t?. 

Lemma 4.4. For any e > 0, the following hold, (i) Suppose t? is a k-spike distribution. For C > 3k /e, 
^9(4^) > 1 — e. (ii) Suppose D is an arbitrary distribution supported in a k-dimensional subspace. For 
C > bk^/e, ^{n) > 1 - e. 

Proof. We prove the first statement. Suppose t? = where is the Dirac delta at point ap 

We use aij to denote the jth coordinate of a*. Since d is isotropic, we know that ~ ^ 

[l/2n, 2/n]. So, if Uij > C/n for some j (or equivalently ai ^ FL), we have pi < 2/C. The lemma thus 
follows since there can be at most k such points. 

To show the second statement, consider two convex polytopes 

Vs = Span(A) n —FL and V = Span(A) n FL, 
k 

where ^FL = [—C/kn,C/kifif. Both Vi and V 2 are symmetric A:-dimensional bodies. By classical result 
from convex geometry we can find a linear Iransformalion fC of fhe unif hypercube [—1, +1]*^, such fhaf 
K, C Span(74) and 

VsF}CFkVs = V. 

Now, we confine ourselves in Span(^). /C has 2k faces of codimension 1. For each such face F, consider 
fhe polyhedron 

Cp = {x \ X = ay, for some a > 1 and y G F}. 

In ofher words, F separates fhe cone generafed by F info fwo parfs andCp is fhe unbounded parf. We claim 
lhat7?(Ci;’) < 2^/(7 for any face F. Consider fhe normalized vector ri? = x t9(dx)/t9(Cp). Since rp is a 
convex combinafion of vectors in and Ci? is convex, rp is in Cp. Moreover, if is easy to see Vs FiCp = 0. 

^This can be seen either from John’s theorem, or the fact that Banach-Mazur distance between any two norms in R* is at most 
k (see, e.g., [39]). 
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So there must be a coordinate of rp whose value is larger than C/nk. Since r = f x'd{dx) > 'd{CF)rp, we 
must have 'd{Cp) < ‘IkjC. All such Cp^ together fully cover the region outside V, and there are at most ‘Ik 
such Ci?s. So the total mass outside V is at most Ak'^ jC. ■ 

Lemma 4.5. Let L = O (^^k/n ■ C/e^. Let V = Span(A) n TL.. Let v G Span(i?). The following hold. 

(i) If ||u ||2 = 1 then ||n||oo < L. (ii) If ||r;||i = 1 then < ||t !||2 < L. 

(hi) Ifx G M"" with ||x||i = 1, then ||nB(x )||2 < L. (iv) For every point w €V, Hm — nB(t (;)||2 < ejy/n. 

Proof Suppose \B\ = h. Consider the ellipsoid £b = Vk£ H Span(i?). Clearly, the principle axes 
of £b aie ei,..., Ch- Suppose u is an arbitrary point in the boundary of £b and v = rt/||M ||2 is a unit 
vector in Span(i?). Obviously, ||ri||cxD < CVk/n (as u G '/k£ C y/kH) and ||ri ||2 > Hence, 

Ibiloo = Iklloo/ll^tlb < £, which proves part (i). 

Now we show part (ii). The first inequality, < ||r;|| 2 , is always true. To see the second inequality, we 
use the Holder inequality: 


l^lli = < ||n||i||r;||oo = 


\ v \\2 


To prove part (iii), use the Holder inequality again: 


T||r;||2. 


l|nB(x)||2 


(x,nB(x)) ^ ||x||i||nB(x)||oo ^ 
l|nB(x)||2 - ||nB(x)||2 - ■ 


For part (iv), consider an arbitrary point w € V = Span(A) n FL. We can see that w G Vk£. By the 
construction of B, any point in y/k£ has an L 2 distance at most \\eh+i II 2 from Span(i?), so does w. ■ 

We now prove part (ii) of Theorem 4.1. Let Hb supported on Span(H) be such that Trani(??B) < 

ei. Define Dq = nQ(i9) fo be fhe original measure D projecfed fo Q. 

Lemma 4.6. We have that Trani(?9g, -d) < 0(e). 

Proof For any measure p, and subsef S C M”, lef /rjs be the measure A restricted to S. It is easy to see that 


Trani(?9,'(?Q) < Trani(7?|^, nQ(??|^)) + Trani(t?|^, nQ(r)|^)) 


where FL = [— O/n, C/n]^ (the hypercube used in Lemma 4.4). ^ Note that even though the transportation 
distance is measured in Li, the projection is with respect to L 2 distance in this lemma. We first bound 
the term Trani(7?|:^, ng(i9|:^)) by coupling every point p G and ng(p) together. By Lemma 4.5 (iv), 
the L 2 distance from every point mV = Span(A) n A"' n is at most ej^fn from Span(H). Hence, 
Up - ns(p)||i < Vn\\p - flB{p )\\2 < e and ||nB(p)||i < ||p||i + ||p - nB(p)||i < 1 + e, which implies 
nQ(p) = nB(p). Thus the first term is at most e. 

Now, we bound the second term. For any point p G A”, it is easy to see the Li distance from p to ng(p) 
is at most 2 +e. Since the total mass in i9|^is at most e, Trani('i9|:^, ng(T9|:^)) is at most (2 + e)e < 3e. ■ 


Lemma 4.7. Let ei 



Let tIq be as defined in Reduction 1 and suppose Hb is such that 


Trani('!?B, < ei- Then, it holds that TTani{'dQ, id q) < 0(e). 


'’Note that even if two measures are not probability measures, their transportation distance is still well defined as long as both 
have the same total mass. 
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Proof. First, we notice that ?9g = ng('(9) = ng(ngpa^n(g)(??)) = nQ(?9B). So, we have 
Tran2(??Q,t?Q) = Tran2(nQ(7?s), nQ(^B)) < Tran2(7?B, i?b), 


where the last inequality holds since L 2 -projection to a convex set is a contraction and Lemma 2.1 (i). By 
Lemma 4.5 (ii). 


Trani('!?Q,-(?Q) < v^Tran2(-!9Q,-i^g) < v^Tran2(-(?B, < ^/n ■ L ■ Trani('i9B, i?s)- 


Plugging in the value L = 0{^Jk/n ■ C/e), we prove the lemma. 


Proof of part (ii) of Theorem 4.1. By Lemmas 4.6 and 4.7, we have Trani(??, i?g) < Trani('!?, i9g) + 
Trani(?9g, ?9g) < 0(e). By considering the coupling between all points in Q and the corresponding 
points in Support(??), we can see that il is the probability measure supported in A” that has the closest 
Li-transportation distance to i?g. Hence^ Trani(??, i?gj <jrrani('!?, ■iJg) < 0(e). We conclude the proof 
by noting that Trani('!?, i?) < Trani(??, iJg) -|- Trani(??g, ??) < 0(e). ■ 

A is unknown. We now remove the assumption that A is known. First, we obtain a close approximation 
of A using 0(k^n^ log n/e®) 2-snapshot samples as follows. We choose a Poisson random variable N 2 with 
E[A 2 ] = O(k^n^ log n/e^), choose N 2 independent 2 -snapshots, and construct a symmetric n x n matrix 
A where An is the frequency of the 2-snapshot (i,i), for all i G [n], and Aij = Aji is half of the total 
frequency of the 2 -snapshots (i,j) and (j, i), for all i / j. 

Lemma 4.8. The matrix A obtained above with E[A 2 ] = 0( ^ satisfies ||A — A\\ < O ^ ^ 2 ^ 3 / 2 ^- 

We find the basis B as described in Reduction 1, except that we use A instead of A. Since B satisfies all 
properties in Lemma 4.5, the algorithms and analysis in Sections 5, 6.1 and 6.2 continue to work. Suppose 
that we have an estimate of = n^( 7 ?) such that Trani( 7 ?^, < ei. We project to Q = 
(l-|-e)A"^nSpan(i7) to obtain The same proof as that of Lemma 4.7 shows that Trani(i?g,??g) < 0(e). 
So the only remaining task is to prove an analogue of Lemma 4.6 showing that Dq is close to the original 
mixture b. 

Lemma 4.9. Wc have that Trani(?9g, d) < 0(e). 

5 Learning arbitrary mixtures in a fc-dimensional subspace 

Suppose that d is an arbitrary distribution supported on a /c-dimensional subspace Span(A) in M”. It is 
known that in order to learn b within transportation distance e, it is necessary to use iF-snapshot samples 
with K = n(l/e) [36], even in the 1-dimensional case. In this section, we generalize the result to higher 
dimensions. By the reduction in Theorem 4.1, we only need to specify how to leaiw a good approximation 
of dB such that Trani(?9B, < ci- This can be done as follows. B = { 61 ,... ,hh] is an n x /i 
matrix (Recall that B is an orthonormal basis for Span(i?)). Let 6 '^,... , 6 ^ be columns of B^. We use the 
following parameters in this section: C = 0(k‘^/e) as suggested in Lemma 4.4, ei and L are as in ( 6 ), and 



and 



(V) 


"'This may not be true for Li projections. 
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Suppose we take a itT-snapshot sample s = {ii, ..., ix} from i!}, where ii G [n] for i = 1,..., K. Let 
^(s) = ^ (which is an /i-vector). Suppose we have N itT-snapshot samples {si,... ,sn}- We 

define the empirical measure Ji = ^ ^(fr(si)), where (5() is the Dirac delta measure. Our estimation 

for is the image measure = BJl = ^ ^{B]l{5i)). Note that '&b is indeed a discrete measure 

supported on as BJl{si) is an n-vector. We can also see that Ji = B'^'&b since B^B = I. 


Analysis. First, we define ^ fo be fhe measure represenfed in basis B. Hence, is supported over 
Formally, /r = = B^BB^^} = B"^'&. Now, we show fhaf /I is a good estimation 

of //. For this purpose, we introduce an intermediate measure /ttv defined as follows: Suppose fhe K- 
snapshof sample Sj is obfained from disfribufion Si G Span(A) n A”. Nofe fhaf s* is an n-vecfor and lef 
'^N = Ya=i and UN = B'^'Sn- Firsf, we show fj ,n and /r are close. 

Lemma 5.1. Let ^n ttnd J1 be defined as above and K = 0{\ log ^). Then, Tran 2 (//Ar, Jl) < 0{e2L). 

Proof. We simply couple B'^Si G Support(/.iAr) and ;u(sj) G Support(/I') fogefher. Conditioning on Si, 
we can see fhaf E[/I(sj)] = B'^Si. Recall from Lemma 4.5 thaf fhe magnifude of every enfry of B is af mosf 
L. By a sfandard applicafion of fhe Chernoff-Hoeffding bound and a union bound over h coordinafes, we 
can see fhaf Pr[||/I(sj) — B'^Si\\oo > e 2 L/s/h] < < 62 / 2 . Hence, wifh high probabilify, for af 

leasf (1 — e 2 )N samples s*, we have ||^(si) — B'^Si \\2 < 62 ^. Moreover, ||rt(si) — B'^Si \\2 < 0{Ls/h) for 
all i. So, Tran 2 (rtAr, f) < (1 - £ 2 ) • £ 2 -^^ + £2 • 0{Ly/h) < 0{e2L). ■ 

Lemma 5.2. Let /r and hn be defined as above and N = 0(l/€2)^. Then, with probability at least 1 — € 2 , 
it holds that Tran2(rt) Tn) < 0 (£ 2 f")- 


Proof. rtAf is the empirical measure of p,. If is well known fhaf pN —^ P almost surely in the topology of 
weak convergence. In particular, the rate of convergence, in terms of transportation distance, can be bounded 
as follows [2, 43]: for any £ 2 , for A > C for some large constant C depending only on 62 , with probability 
at least 1 — 62 , we have Txaii 2 {pN■, p) < O Plugging A = 0 (l/e 2 )^ yields the result. ■ 

Combining Lemmas 5.1 and 5.2, we obtain Tran 2 (rt, p) = Tran 2 , B^^b) < 0{e2L). Viewing 
B as an operator from L 2 (M^) to Li(M”), its operator norm is 


|H||2^i = sup 
xSM'* 


Iklb 


\\Bxh 
3 ll^^lb 


< y/n. 


So by Lemma 2.1, Trani(? 9 B, = Trani(H/i, Bp) < ||H||2^.i Tran2(rt, p) < 0{e2Ly/n) < ei. 

Combining with Theorem 4.1, we obtain the following theorem for learning an arbitrary (even continu¬ 
ous) fc-dimensional mixture. The sample size bounds for 1- and 2-snapshots below follow from Lemma 4.2 
(taking a = 0(e)) and Lemma 4.8. 

Theorem 5.3. Let dbe a mixture supported on Span(A) n A„, where Span(A) is a k-dimensional subspace. 
Using 0(n log re/e^), 0(A:^n^ log n/e®), and 2-, and K-snapshot samples respectively, where 

K = 0{k^^/e^^), we can obtain, with probability 0.99, a mixture 0 such that Trani( 7 ?, 0) < 0(e) 


6 Learning fc-spike mixtures on A„ 

In this section, we consider the setting where is a A:-spike distribution on A^, that is, 0 is supported on 
k points in A„. This setting was also considered in [36] but unlike the results therein, our sample size 
bounds only depend on n and k and not on any “width” parameters of d (e.g., the least weight of a mixture 
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constituent, or the distance between two spikes). We use if-snapshot samples only ^or K = 2k — 1 in this 
section, which is known to be necessary [36]. 

The high level idea of our algorithm is as follows. Again, given the reduction of Section 4, we 
only need to provide an algorithm for learning a good approximation -dB for the projected measure 
'ds '■= ngpan(s)(i?)- More specifically, we need Trani(i?B, < ei. For this purpose, we pick a fine 
nef of direcfions in Span(i?) and learn fhe 1-dimensional projected measures on these directions. Then we 
use the 1-dimensional projected measures to reconstruct nspan(B)i9- The reconstruction can be done by a 
linear program that is similar to LPi in Section 3.1. The most crucial and technically challenging part is to 
show that if the ID-projections of two measures are close (in Tran), then the two measures must be close 
as well (Lemma 6.3). To do this, we leverage Yudin’s theorem (Theorem 2.7), which shows that any 1-Lip- 
function / in 62 ( 1 ) admits a good approximation in terms of certain ID-functions with bounded Lipschitz 
constant. Since the ID-projections of the two measures are close, the Kantorovich-Rubinstein theorem im¬ 
plies that the RHS of (3) is small for these ID functions, and hence that the RHS of (3) is small for /. This 
implies (again by (3)) that the two measures are close in Tran. 

Theorem 6.1. Let {) be an arbitrary k-spike mixture in A„. Using 0(nlogn/e^), 0{k^v? logn/e®), and 
(k/e)^^^ ) 1- and 2- and {2k — l)-snapshot samples respectively, we can obtain, with probability 0.99, a 
mixture 0 such that Trani(??, 0) < 0 (e). 

6.1 Projecting to one dimension 

Assume B = { 61 ,..., bh], where h = dim(Span(i7)) < k. We use the following parameters: C = 0{k/e) 
as suggested in Lemma 4.4, ei and L are defined as in ( 6 ), and 

K = 2k+ 1, R = 0 , €2 = ef (8) 

Lef r be a sef of n-dimensional vectors (we call fhem directions) in Span(i7), where each f S T is 
given by t = Ym=i ^ THi ' j ofher words, each direcfion t = (fi,..., f/i) G T 

has fhe form U ^ ■ {—R ,..., 7?} in basis B. If is easy to see for any t e T, ||t||2 < 1 . Consider fhe sef 

of 1-dimensional “projected” measures where dt is defined as 

dt{S) ;= d{{x I {t,x) G 5}) for any S' C M. 

Now, we show how to estimate fhe projected measure dt for each t £ T. Since ||x||i = 1 for any 

X G Support('i9), we can see dt is supported within [—||f||oo) ||f||oo]- Let (j){x) = 2 p|-h | which maps 

[~||f||cxD, ||f||oo] to [0) 1]- Suppose we get a Tf-snapshot sample from the original mixture. We need to 
describe how to convert this sample to a Tf-snapshot sample for the 1-dimensional problem for estimating 

dt. 

1. For each sampled letter in the iT-snapshot sample, say the letter is z G [n], we get a sample “1” for the 
1 -d problem with probability {L is the zth coordinate of t), and a sample “0” with probability 

1 - 0 (fj). 

2. We feed those Tf-snapshot samples to the algorithm for the 1-d problem (see Section 3.1) and obtain 
a measure d[. Our estimation for dt is dt defined as dt{S) = d{{4>{S)) for any S C [—||f||oo) INloo]- 

We firsf need a bound on how good our esfimafion dt is. 

Lemma 6.2. Using {kL/many K-snapshot samples, the above algorithm can pro¬ 
duce, with probability 0.99, an estimation dt such that Tran 2 (r?f, r?f) < €2 for each t £ T. 
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Proof. Let be the 1-dimensional measure supported on [0,1] defined as ^[{S) = "dtif ^{S)) for any 
S C [0,1]. A moment reflection shows that is exactly the mixture that generates the converted K- 
snapshot samples (i.e., the 0/1 samples generated in step 1). Let e' = 62 /^^ = By Theorem 3.9, using 
{k/e')^^^\ the algorithm returns with Tran(i?^, 1 ?^) < e'. The function f stretches the length by a factor 
of l/ 2 ||f||oo (shifting by a constant does not affect transportation distance), so 

Tran 2 ('! 9 i,-! 9 t) = Tran(??j, • 2 ||f||oo < 2 ||f||ooe' < 2 Le' = 62 . ■ 

6.2 Reconstructing from the ID-projections 

We use IIb as a short for ngpan(B) and use '&b to denote the projection of i} to Span(i?), i.e., '&b = nB(r?). 
We now reconstruct t^b from the 1-dimensional projections {'&t}teT- 

Now, we show how to obtain a probability measure '&b such that Tran(??B; < 0{e). Let Sp = 
Span(i?) n B 2 (L) where B 2 (L) is the L 2 ball in with radius L. By Lemma 4.5 (iii), = 115 ( 1 ?) is 
supported on Sp. It is well known that there is a e 2 -net Af of size (L/e 2 )^^^^ = for Sp (see 

e.g., [21, 11]), i.e., for any point p € Sp, there is a point s £ Af such that jjp — s ||2 < £ 2 - Therefore, for 
any probability measure supported over Sp, there is a discrete distribution Q with support Af such that 
Tran 2 (? 9 ,Q) < € 2 . Now, we try to find a distribution Q such that Trani(??t,Qt) < 62 for each t £ T, 
where 62 is defined in ( 8 ). Consider the following linear program (LP 2 ): For each point q £ Af, we have a 
variable Hq {pq > 0) corresponding to the probability mass at point q a variable Xpq > 0 representing the 
mass transported from a point p £ Support(??t) to g S Af. Note that fft is also a discrete distribution, so the 
constraint about the transportation distance can be encoded exactly as a linear program: 

LP 2 : '^Xpq = Vq for all q£Af] 

p 

Xpq = for all p £ Support(-i? 4 ); 

Q 

^|p-(g,f)|a:p, < 62 ; ^^^ = 1 . 

p,g 1 

Suppose Q is a discrete distribution with support Af such that Tran 2 (( 5 , t?) < 62 . From Lemma 2.1, we can 
see that Tran((5t, i?t) < 62 for all f e T as well {{t,x) for ||f |[2 < 1 is a contraction). Hence, LP has a 
feasible solution. We obtain a feasible solution Q to LP and let 1)5 = Q be our estimate of 

Analysis. Any feasible solution Q to LP satisfies Tran((5t, i?t) < 62 . The following crucial lemma asserts 
that if the corresponding 1 -dimensional projections of two measures are close in transportation distance for 
every direction, the original measures must be close too. Thus, we obtain that Trani(?? 5 ,-i^s) < 0(ei); 
combining this with Theorem 4.1 yields Theorem 6.1. 

Lemma 6.3. For any probability measure P £ Sp, we use Pt to denote the 1-dimensional measure 
Pt{S) := P{{x I {t,x) £ S}) for any S' C M. Consider two probability measures P and Q over Sp. 
If Tran(Pt, Qt) < 0{e2) for all t £ T, then Trani(P, Q) < 0(ei). 

Proof. Consider a function / that is supported on Sp = Span(H)nB 2 (L) and 1-Lip in Li distance (denoted 
as f £ l-Lip(Sp, Li)). From Lemma 4.5 (ii), we can see f{x) is -^-Lip in L 2 distance. Hence, f{xL), sup¬ 
ported on Sp = Span(i?)nB 2 (1), is 1-Lip in L 2 distance. From now on, let us switch to the representation in 
basis B for the rest of the proof. For any / £ l-Lip(Sp, Li), using Yudin’ Theorem (Theorem 2.7) and after 
scaling, we can see that there exist c(f') £ C for f' £ n such that \ f{x) — (UBf){x)\ < 0{h/R) 

where (/^/(x) = 
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Now, fix some f G T. In basis B, t' = Rht is an integer vector. By Kantorovich-Rubinstein theorem, 
for any t £ T, we have | f gd(Pt — Qt)\ < aei for any g G a-Lip where a is a positive number. Consider 
function where i is the imaginary unit. It is easy to see both its real part and imaginary part are in a- Lip. 
Therefore, we have 


p--d{Pt-Qt) 


< 0 ( 0 ( 62 ). 


Now, we make a simple but crucial observation that links the projected measure Pt to the characteristic 
function of P: 

j any f G T and f' = Rht. 

In fact, this can be seen from (1), by viewing Pt as the image measure of P under the function (t, x). 

By the Kantorovich-Rubinstein theorem, Tran(P, Q) = supjgi.Ljp(Sp |/ fd{P — (5)|. Consider an 
arbitrary / G l-Lip(Sp, Li). We have that 


fdiP-Q) 




I U„fd(P-Q) 

< E +o(^) 




y; I<;(i)|. yeM'/idW-Q,) +o(|) 


< 


hRe2 


L 


ic(f')i+ o 


t'ez'*nB'*(ij) 


R 


Since |c(f')| < exp(0((i)), choosing R = O(^) and 62 = L, we have that |/ fd{P — Q)] < 

0(ei). Taking supremum on both sides completes the proof of the lemma. ■ 


Proof of Theorem 6.1. As noted earlier, any feasible solution Q to LP satisfies Tran((5t, < £ 2 - By 
Lemma 6.3 below and noticing that 


dt = ni(i?) = Ut{UB^) = M^b) = 

we can see that Trani(Q,-fis) < 0(ei). Reduction 1 and Theorem 4.1 therefore show that we obtain 
satisfying the stated transportation-distance bound. 

The sample size bounds for 1- and 2-snapshots below follow from Lemma 4.2 (taking a = 0(e)) 
and Lemma 4.8 respectively. Overall, we need to estimate R^^O = (/i/e)*^!^) many Ht^, each requiring 
(k/e)^^^ ) many (2k — l)-snapshot samples (by Lemma 6.2). ■ 
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A Proof of Lemma 2.1 

Since Tran(/r, a) < e, there exists a coupling W between p and u such that 

J \\x - y\\Yd{W{x,y)) < e. 

W can also be viewed as a coupling between Tp and Tv. Therefore, 

Tran(T/r,rz^) < J \\Tx-Ty\\Yd{W{x,y)) < J \\T\\x^Y-\\x-y\\x d{W{x,y)) < \\T\\x^y e. 

The second statement can be shown in exactly the same way. The third is as simple. Suppose W is the 
optimal coupling between p and u. Then, we can see that 

Tmn{Tp,T’u) < j \\Tx - T'x'\\YW{d{x,x')) 

< j {\\Tx-Tx'\\y+ \\Tx' -T'x'\\Y)W{d{x,x')) 

<\\T\\x^y I \\x-x'\\xW{d{x,x')) + \\x'\\x I \\T-T'\\x^YW{d{x,x')) 

< ||r||x^y Tran(^, v) + ||x'||x • e = 0(e). ■ 

B Proofs from Section 4 

Proof of Lemma 43. With 0(^n log n) independent 1-snapshot samples, we can assume that iys satisfy 
the statement of Lemma 4.2. We modify the mixture as follows: If there is a letter i G [n] such that 
Xi < 2a!n, we simply eliminate this letter. The total probability of eliminated letters is at most 4a, which 
incurs at most an additive 4a < e term in transportation distance. For each of the remaining letter i € [n], 
we “split” it into rii = [nri/uj copies, and the probability of i is equally spit among these copies. For the 
eliminated letter i, we can think = 0. Let i? be the modified mixture. 

Consider an m-snapshot from the original mixture d. If the snapshot includes an eliminated letter, we 
ignore this snapshot. Otherwise, each letter i in the snapshot is replaced with one of its n* copies, chosen 
uniformly at random. Then, we feed the algorithm for learning d with this snapshot (we can easily see the 
snapshot is distributed exactly the same as one generated from d). Suppose d is an estimate of d (returned by 
the algorithm). To obtain an estimate of the original mixture, for each constitute of d, we have a constitute 
in which the probability of letter i is the sum of the probabilities of the n* copies. 

Now, we show d is isotropic. Let n' = < n/ahe. number of new letters in d. We can see 

n' > > 1^. For each non-eliminated item i, we have rj/rj G [31/32,33/32]. Then, we can easily 

verify that for each new item i!, we have f*/ = ^j- Therefore, d is isotropic. ■ 
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Proof of Lemma 4.8. Let D = N 2 {A — A). It is easy to see that = 0. Moreover, since N 2 is a 

Poisson random variable, DijS are independent of each other. Let Xfj = 1 if the £-th snapshot is (i, j). Let 
Yl- = Xfj - Aij. So, Dij = Y-y We can see that 

Var[L)jj I X 2 = n2] = n2Yar[Yjj] < n2Aij < n2/n. 

Let K = log n/e^). Using Bernstein’s inequality (Proposition 2.3), we can see that for any 

712 < 2E[A^’2], 

Prfl-DjJ > K \ N 2 = 712 ] < 2max/exp ( -^ ,exp(—SitT)! <1- 

I V ^2 / J exp(7i) 

With a union bound and the fact that Pr[7i2 > 2E[A^2]] < 1 — exp(—n) , we can see that with probability 
1 — exp(—Ti), \Dij\ < K for all i,j. Let £ denote the event \Dij\ < K for all i,j. Notice that conditioning 
on £, DijS are still independent of each other. Moreover, 

Var[Ai I £] = Var[Ai | lAjI < K] < VarfAj] = E[A^2] Var[yA] < E[W2]/n. 

Conditioning on £, we can apply Theorem 2.5 and obtain that, with probability 1 — 1/ poly(7i). 


Pll < 2Ul^i^ • V^ + 0(V^(E[W'2])^/^lnn) 

V TX 

Plugging in the value of E[A^ 2 ] and K, we can see that, with high probability 1 — 1/ poly(?T,), 

Proof of Lemma 4.9. Suppose the spectral decomposition of ^ is A = Yli=i ■ where Ai > A 2 > 

• • • > Afc > 0 are the eigenvalues. Let 7 = e^/kn. Suppose 7 < A^/ < A^^+i < ... < A^. It is easy to see 
that there must a value k' <j<k such that Aj_i — \j > ^/k. Define A' to be the truncation 


a ! AjUjuf. 

i:i<j 


First, we can see from the definition of A that for any i, 

Xi = {vi,Avi) = j {vi,x)^4){dx) = 


(uj, x)^7?(dx) 


Then, we have that 


J ||x — n^'x|||79(dx) < J^'^^{vi,x)vi 79(dx) 

i:i>j 

^ (ui, x)^79(dx) < f ^ {vi,x)^4){dx) 


i:i>j 
^ Aj < e^/n. 

i:Ai<7 


i:Ai<7 
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Now, we can bound the transportation distance between -& and using Cauchy-Schwarz, as follows: 


Tran2(?9, < J \\x — 


1/2 


< ( / ||x — n^'x|||'i9(dx) J l'!?(dx) 

< Oie/V^). 

Suppose A has the spectral decomposition A = Yli=i A' = Yli i<j • Note that = 

IIj, (n^?9). Exactiy the same proof also shows that 

Tran2(n^??,n^,i9) < 0{e/^). 

All nonzero eigenvalues of A' and A' are at least e^/ [kn). Let be the matrix of canonical angles between 


Span(A') and Span(A'). Using Wedin’s Theorem (Theorem 2.4) and since ||A — A|| < O (^ ^^ 2 ^ 3/2 we 
can see that 

Tran2(nA''i?,n^,??) < j ||n^/(x) - n^,(x)||2??(dx) < j ||n^/- n^,|| • ||a;||2?9(dx) 

< J II sin4)||2??(dx) < II sin$||2 

<^<0(e/V^). 

Combining the above inequalities, we can see that 

Trani( 7 ?,-dj) < v^Tran 2 ('!?, < 0{e). 

To show that Trani(?9^, -i^q) < e, we can use exactly the same proof as that of Lemma 4.6. 
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